Part-of-Speech Tagging of Northern Sotho: Disambiguating Polysemous Function Words

نویسندگان

  • Gertrud Faaß
  • Ulrich Heid
  • Elsabé Taljard
  • Danie Prinsloo
چکیده

A major obstacle to part-of-speech (=POS) tagging of Northern Sotho (Bantu, S 32) are ambiguous function words. Many are highly polysemous and very frequent in texts, and their local context is not always distinctive. With certain taggers, this issue leads to comparatively poor results (between 88 and 92 % accuracy), especially when sizeable tagsets (over 100 tags) are used. We use the RF-tagger (Schmid and Laws, 2008), which is particularly designed for the annotation of fine-grained tagsets (e.g. including agreement information), and we restructure the 141 tags of the tagset proposed by Taljard et al. (2008) in a way to fit the RF tagger. This leads to over 94 % accuracy. Error analysis in addition shows which types of phenomena cause trouble in the POS-tagging of Northern Sotho.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Grammar-based tools for the creation of tagging resources for an unresourced language: the case of Northern Sotho

We describe an architecture for the parallel construction of a tagger lexicon and an annotated reference corpus for the part-of-speech tagging of Nothern Sotho, a Bantu language of South Africa, for which no tagged resources have been available so far. Our tools make use of grammatical properties (morphological and syntactic) of the language. We use symbolic pretagging, followed by stochastic t...

متن کامل

Development of prototype text-to-speech systems for northern sotho

Two text-to-speech synthesis systems were developed for one of the eleven official languages of South Africa, viz. Northern Sotho. A diphone synthesis system, based on extraction of diphones from nonsense words, was constructed. A cluster unit selection synthesis system, based on recordings of sentences containing a selection of most common words in Northern Sotho, was also built. The Festival ...

متن کامل

برچسب‌گذاری ادات سخن زبان فارسی با استفاده از مدل شبکۀ فازی

Part of speech tagging (POS tagging) is an ongoing research in natural language processing (NLP) applications. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The purpose of POS tagging is determining the grammatical ...

متن کامل

Combining Independent Knowledge Sources for Word Sense Disambiguation

Disambiguation Yorick Wilks and Mark Stevenson Department of Computer Science, University of She eld, Regent Court, 211 Portobello Street, She eld S1 4DP, UK fyorick, [email protected] Abstract Sense tagging, the automatic assignment of the appropriate sense from some lexicon to each of the words in a text, is a specialised instance of the general problem of word sense disambiguation. We di...

متن کامل

Part - of - Speech Tagging Usinga Variable Memory Markov

We present a new approach to disambiguating syntactically ambiguous words in context, based on Variable Memory Markov (VMM) models. In contrast to xed-length Markov models, which predict based on xed-length histories, variable memory Markov models dynamically adapt their history length based on the training data, and hence may use fewer parameters. In a test of a VMM based tagger on the Brown c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009